11.2.2 Exercises

1. What function would you use to read a file where fields were separated with “|”?

The read_delim() function with the argument delim = "|".

read_delim(file, delim = "|")

2. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

read_csv() and read_tsv() have all of the same arguments as they are both special cases of read_delim(). These arguments include col_names, col_types, locale, na, quoted_na, quote, comment, trim_ws, skip, n_max, guess_max, and progress.

3. What are the most important arguments to read_fwf()?

The most important argument to read_fwf() is col_positions (defines where the data columns begin and end).

4. Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?

"x,y\n1,'a,b'"
read_delim("x,y\n1,'a,b'", delim = ",",  quote = "'")
## # A tibble: 1 x 2
##       x y    
##   <int> <chr>
## 1     1 a,b

5. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

read_csv("a,b\n1,2,3\n4,5,6")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col     row col   expected  actual    file         expected   <int> <chr> <chr>     <chr>     <chr>        actual 1     1 <NA>  2 columns 3 columns literal data file 2     2 <NA>  2 columns 3 columns literal data
## # A tibble: 2 x 2
##       a     b
##   <int> <int>
## 1     1     2
## 2     4     5

Only 2 column headers are given (a and b), but there are 3 columns in the rows. When you run the code, the values in the third column are dropped.

read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col     row col   expected  actual    file         expected   <int> <chr> <chr>     <chr>     <chr>        actual 1     1 <NA>  3 columns 2 columns literal data file 2     2 <NA>  3 columns 4 columns literal data
## # A tibble: 2 x 3
##       a     b     c
##   <int> <int> <int>
## 1     1     2    NA
## 2     1     2     3

3 column headers are given (a, b, and c), but there are only 2 columns in the first row and there are 4 columns in the last row. When you run the code, there is a missing (NA) value in the third column of the first row and the value in the fourth column of the last row is dropped.

read_csv("a,b\n\"1")
## Warning: 2 parsing failures.
## row # A tibble: 2 x 5 col     row col   expected                     actual    file         expected   <int> <chr> <chr>                        <chr>     <chr>        actual 1     1 a     closing quote at end of file ""        literal data file 2     1 <NA>  2 columns                    1 columns literal data
## # A tibble: 1 x 2
##       a b    
##   <int> <chr>
## 1     1 <NA>

There is no closing quote around the 1, thus the quote preceding the 1 is dropped when you run the code (and thus a is treated like an integer). There is a missing (NA) value in the first row of the b column.

read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
##   a     b    
##   <chr> <chr>
## 1 1     2    
## 2 a     b

When you run the code, a and b are both treated as character vectors since the second row contains non-numeric strings.

read_csv("a;b\n1;3")
## # A tibble: 1 x 1
##   `a;b`
##   <chr>
## 1 1;3

When you run the code, a;b is treated as one column header and 1:3 is treated as one value within that column (read_csv looks for commas rather than semi-colons).

11.3.5 Exercises

1. What are the most important arguments to locale()?

encoding is the most important argument to locale() since it affects how files are read. decimal_mark and grouping_mark are also important as they define the symbols used to indicate the decimal place/to chunk larger numbers.

2. What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?

locale(decimal_mark = ".", grouping_mark = ".")
## Error: `decimal_mark` and `grouping_mark` must be different

If decimal_mark and grouping_mark are set to the same character, then you get an error.

locale(decimal_mark = ",")
## <locale>
## Numbers:  123.456,78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
##         Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
##         (May), June (Jun), July (Jul), August (Aug), September
##         (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM

If decimal_mark is set to “,” then grouping_mark is set to “.”.

locale(grouping_mark = ".")
## <locale>
## Numbers:  123.456,78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed),
##         Thursday (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May
##         (May), June (Jun), July (Jul), August (Aug), September
##         (Sep), October (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM

If grouping_mark is set to “.” then decimal_mark is set to “,”.

3. I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.

date_format and time_format define the default data and time formats, which are used by functions such as parse_date() and parse_time().

Examples:

parse_date("01/02/15", locale = locale(date_format = "%d/%m/%y"))
## [1] "2015-02-01"
parse_time("02-00-08 am", locale = locale(time_format = "%M-%S-%I %p"))
## 08:02:00

4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.

Not applicable.

5. What’s the difference between read_csv() and read_csv2()?

read_csv() uses a comma as the delimiter, whereas read_csv2() uses a semi-colon as the delimiter.

6. What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.

Europe: ISO 8859-1, ISO 8859-2, ISO 8859-3, etc…

Asia: Big5, GB 2312, EUC-KR, EUC-JP, etc…

7. Generate the correct format string to parse each of the following dates and times:

d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"

parse_date(d1, "%B %d, %Y")
## [1] "2010-01-01"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
parse_time(t1, "%H%M")
## 17:05:00
parse_time(t2, "%H:%M:%OS %p")
## 23:15:10.12

12.2.1 Exercises

1. Using prose, describe how the variables and observations are organised in each of the sample tables.

table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

In table1, each variable (country, year, cases, population) has its own column and each observation has its own row.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

In table2, each row is defined by country, year, type (“cases” or “population”), and count.

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

In table3, each row is defined by country, year, and rate (where rate is the number of cases divided by the population count as a character string).

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

Table 4 is spread across two tibbles, table4a for cases and table4b for population. Within each table, each row is defined by country and the number of cases or the population count for the year 1999 and the year 2000.

2. Compute the rate for table2, and table4a + table4b. You will need to perform four operations: 1) Extract the number of TB cases per country per year. 2) Extract the matching population per country per year. 3) Divide cases by population, and multiply by 10000. 4) Store back in the appropriate place. Which representation is easiest to work with? Which is hardest? Why?

For table2:

countries <- filter(table2, type == 'cases')$country
years <- filter(table2, type == 'cases')$year
count_cases <- filter(table2, type == 'cases')$count
count_population <- filter(table2, type == 'population')$count
rate = count_cases/count_population * 1000
tibble(country = countries, year = years, rate = rate)
## # A tibble: 6 x 3
##   country      year   rate
##   <chr>       <int>  <dbl>
## 1 Afghanistan  1999 0.0373
## 2 Afghanistan  2000 0.129 
## 3 Brazil       1999 0.219 
## 4 Brazil       2000 0.461 
## 5 China        1999 0.167 
## 6 China        2000 0.167

For table4a + table4b:

countries <- table4a$country
cases_1999 <- table4a$`1999`
cases_2000 <- table4a$`2000`
population_1999 <- table4b$`1999`
population_2000 <- table4b$`2000`
rate_1999 <- tibble(country = countries, year = 1999, rate = cases_1999 / population_1999 * 10000)
rate_2000 <- tibble(country = countries, year = 2000, rate = cases_2000 / population_2000 * 10000)
(table4_rate <- rbind(rate_1999, rate_2000) %>% arrange(country))
## # A tibble: 6 x 3
##   country      year  rate
##   <chr>       <dbl> <dbl>
## 1 Afghanistan 1999. 0.373
## 2 Afghanistan 2000. 1.29 
## 3 Brazil      1999. 2.19 
## 4 Brazil      2000. 4.61 
## 5 China       1999. 1.67 
## 6 China       2000. 1.67

table2 is easier to work with since all of the data is in one table.

3. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?

First need to filter table2 to have rows with type == cases only.

table2_cases = filter(table2, type == "cases")
ggplot(data = table2_cases, aes(year, count)) +
  geom_line(aes(group = country), colour = "grey50") +
  geom_point(aes(color = country)) 

12.3.3 Exercises

1. Why are gather() and spread() not perfectly symmetrical? Carefully consider the following example:

(stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
))
## # A tibble: 4 x 3
##    year  half return
##   <dbl> <dbl>  <dbl>
## 1 2015.    1.  1.88 
## 2 2015.    2.  0.590
## 3 2016.    1.  0.920
## 4 2016.    2.  0.170
stocks %>% 
  spread(year, return) %>% 
  gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
##    half year  return
##   <dbl> <chr>  <dbl>
## 1    1. 2015   1.88 
## 2    2. 2015   0.590
## 3    1. 2016   0.920
## 4    2. 2016   0.170

(Hint: look at the variable types and think about column names.) Both spread() and gather() have a convert argument. What does it do?

spread() and gather() are not perfectly symmetrical because information about column types is not transferred between them. In the example above, after running spread() and gather(), the year column changes from type dbl to type chr. This is because after running spread(), 2015 and 2016 become column names, which are then treated as strings when using gather().

2. Why does this code fail?

table4a %>% 
  gather(1999, 2000, key = "year", value = "cases")
## Error in inds_combine(.vars, ind_list): Position must be between 0 and n

This code fails because 1999 and 2000 need to be in quotes (since they refer to names of columns in table4a rather than the 1999th and 2000th columns of table4a).

table4a %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

3. Why does spreading this tibble fail? How could you add a new column to fix the problem?

people <- tribble(
  ~name,             ~key,    ~value,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

spread(people, key, value)
## Error: Duplicate identifiers for rows (1, 3)

Spreading this tibble fails because both the first and third rows contain a different value for “age” of “Phillip Woods”. A new column with the number of the observation could be added to fix the problem, as shown below.

people <- tribble(
  ~name,             ~key,    ~value, ~obs_num,
  #-----------------|--------|------|------
  "Phillip Woods",   "age",       45, 1,
  "Phillip Woods",   "height",   186, 1,
  "Phillip Woods",   "age",       50, 2,
  "Jessica Cordero", "age",       37, 1,
  "Jessica Cordero", "height",   156, 1
)

spread(people, key, value)
## # A tibble: 3 x 4
##   name            obs_num   age height
##   <chr>             <dbl> <dbl>  <dbl>
## 1 Jessica Cordero      1.   37.   156.
## 2 Phillip Woods        1.   45.   186.
## 3 Phillip Woods        2.   50.    NA

4. Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?

preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

The tibble needs to be gathered. The variables are sex (male or female), pregnant (yes or no), and count.

preg %>%
  gather(male, female, key = "sex", value = "count", na.rm = TRUE)
## # A tibble: 3 x 3
##   pregnant sex    count
## * <chr>    <chr>  <dbl>
## 1 no       male     20.
## 2 yes      female   10.
## 3 no       female   12.

12.4.3 Exercises

1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 h     i     j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i

The extra argument in separate() controls what happens when there are too many pieces. The default behavior is to emit a warning and drop extra values (warn). Alternative options are to drop any extra values without a warning (drop) or to split the extra values (merge).

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"), extra = "drop")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 h     i     j
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"), extra = "merge")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f,g  
## 3 h     i     j

The fill argument in separate() controls what happens when there are not enough pieces. The default behavior is to emit a warning and fill from the right (warn). Alternative options are to fill with missing values on the right (right) or to fill with missing values on the left (left).

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "right")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "left")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 <NA>  d     e    
## 3 f     g     i

2. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

If set to TRUE, the remove argument in unite() and separate()removes the input column from the output data frame. You would set it to FALSE if you want to create a new variable while keeping the old variable.

table1 %>% 
  separate(year, into = c("century", "year_2"), sep = 2, remove = TRUE)
## # A tibble: 6 x 5
##   country     century year_2  cases population
##   <chr>       <chr>   <chr>   <int>      <int>
## 1 Afghanistan 19      99        745   19987071
## 2 Afghanistan 20      00       2666   20595360
## 3 Brazil      19      99      37737  172006362
## 4 Brazil      20      00      80488  174504898
## 5 China       19      99     212258 1272915272
## 6 China       20      00     213766 1280428583
table1 %>% 
  separate(year, into = c("century", "year_2"), sep = 2, remove = FALSE)
## # A tibble: 6 x 6
##   country      year century year_2  cases population
##   <chr>       <int> <chr>   <chr>   <int>      <int>
## 1 Afghanistan  1999 19      99        745   19987071
## 2 Afghanistan  2000 20      00       2666   20595360
## 3 Brazil       1999 19      99      37737  172006362
## 4 Brazil       2000 20      00      80488  174504898
## 5 China        1999 19      99     212258 1272915272
## 6 China        2000 20      00     213766 1280428583

With remove = TRUE, year is removed from the data frame; with remove = FALSE, year remains in the data frame.

3. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?

extract() uses regular expression to capture groups and turn each group into a new column. separate() uses either regular expression or a vector of character positions to turn a single character column into multiple columns. While there are several ways to separate a column into multiple columns, there is only way to put together multiple columns into a single column, thus it makes sense that there are 3 variations of separation but only one unite.

12.5.1 Exercises

1. Compare and contrast the fill arguments to spread() and complete().

If fill is set in spread(), all explicit missing values (i.e. NA) and implicit missings (rows that aren’t present) will be replaced by the fill value. In complete(), NAs under different variables can be replaced by different values. The fill argument in complete() takes in a list specifying the values with which to replace NA with for each variable, while the fill argument in spread() only takes in one value.

2. What does the direction argument to fill() do?

The direction argument to fill() specifies the direction in which to fill missing values. The default is “down” (any NAs will be replaced by the previous non-missing value) but it can also be set to “up”.

12.6.1 Exercises

1. In this case study I set na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

who %>%
  group_by(country) %>%
  summarize(year_min = min(year), year_max = max(year)) %>%
  ggplot() +
  geom_point(mapping = aes(x = year_min, y = country), color = 'red') +
  geom_point(mapping = aes(x = year_max , y= country), color = 'blue')

While most countries have data starting in 1980 and ending in 2013, there are implicit missing values, as shown by the plot above.

sum(who %>% select(-c(1:4)) == 0, na.rm = TRUE)
## [1] 11080
sum(who %>% select(-c(1:4)) %>% sapply(function(x){sum(is.na(x))}))
## [1] 329394

There are 11080 zero values (which indicate there were no cases of TB reported) and 329394 NA values. Setting na.rm = TRUE seems reasonable in this case.

2. What happens if you neglect the mutate() step? (mutate(key = stringr::str_replace(key, "newrel", "new_rel")))

who %>%
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) %>%
  #mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "type", "sexage"), sep = "_") %>%
  tail()
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows
## [73467, 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476,
## 73477, 73478, 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
## # A tibble: 6 x 8
##   country                       iso2  iso3   year new   type  sexage cases
##   <chr>                         <chr> <chr> <int> <chr> <chr> <chr>  <int>
## 1 Venezuela (Bolivarian Republ… VE    VEN    2013 newr… f65   <NA>     402
## 2 Viet Nam                      VN    VNM    2013 newr… f65   <NA>    3110
## 3 Wallis and Futuna Islands     WF    WLF    2013 newr… f65   <NA>       2
## 4 Yemen                         YE    YEM    2013 newr… f65   <NA>     360
## 5 Zambia                        ZM    ZMB    2013 newr… f65   <NA>     669
## 6 Zimbabwe                      ZW    ZWE    2013 newr… f65   <NA>     725

If you neglect the mutate() step, then the sexage column will have a value of NA for keys that began with “newrel”.

3. I claimed that iso2 and iso3 were redundant with country. Confirm this claim.

select(who, country, iso2, iso3) %>%
  distinct() %>%
  group_by(country) %>%
  filter(n() > 1)
## # A tibble: 0 x 3
## # Groups:   country [0]
## # ... with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>

4. For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.

who %>%
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) %>%
  mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "type", "sexage"), sep = "_") %>%
  select(-new, -iso2, -iso3) %>%
  separate(sexage, c("sex", "age"), sep = 1) %>%
  group_by(country, year, sex) %>%
  summarise(cases = sum(cases)) %>%
  unite(country_sex, country, sex, remove = FALSE) %>%
  ggplot(aes(x = year, y = cases, group = country_sex, colour = sex)) +
  geom_line()

13.2.1 Exercises

1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

You would need to combine the flights and airports tables, as you would need to match the dest and origin variables from flights with the faa variable from airports.

2. I forgot to draw the relationship between weather and airports. What is the relationship and how should it appear in the diagram?

weather and airports can be matched using the variable origin in weather and the variable faa in airports.

3. weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?

If it contained weather records for all airports in the USA, you could also connect weather with flights through dest.

4. We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?

You could create a table with columns for year, month, day, and holiday_name. The primary key would be the unique date (year, month, day combination) of the holiday, and you could connect the table to existing tables (for example, flights and weather) based on the date.

13.3.1 Exercises

1. Add a surrogate key to flights.

flights %>%
  mutate(flight_id = row_number(year)) %>%
  select(flight_id, everything())
## # A tibble: 336,776 x 20
##    flight_id  year month   day dep_time sched_dep_time dep_delay arr_time
##        <int> <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1         1  2013     1     1      517            515        2.      830
##  2         2  2013     1     1      533            529        4.      850
##  3         3  2013     1     1      542            540        2.      923
##  4         4  2013     1     1      544            545       -1.     1004
##  5         5  2013     1     1      554            600       -6.      812
##  6         6  2013     1     1      554            558       -4.      740
##  7         7  2013     1     1      555            600       -5.      913
##  8         8  2013     1     1      557            600       -3.      709
##  9         9  2013     1     1      557            600       -3.      838
## 10        10  2013     1     1      558            600       -2.      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

2. Identify the keys in the following datasets: 1) Lahman::Batting, 2) babynames::babynames, 3) nasaweather::atmos, 4) fueleconomy::vehicles, 5) ggplot2::diamonds. (You might need to install some packages and read some documentation.)

#Lahman::Batting %>% head()
Lahman::Batting %>%
  group_by(playerID, yearID, stint) %>%
  mutate(n = n()) %>%
  filter(n > 1) %>%
  nrow()
## [1] 0

The primary key for Lahman::Batting is playerID, yearID, stint.

#babynames::babynames %>% head()
babynames::babynames %>%
  group_by(year, sex, name) %>%
  mutate(n = n()) %>%
  filter(n > 1) %>%
  nrow()
## [1] 0

The primary key for babynames::babynames is year, sex, name.

#nasaweather::atmos %>% head()
nasaweather::atmos %>%
  group_by(lat, long, year, month) %>%
  mutate(n = n()) %>%
  filter(n > 1) %>%
  nrow()
## [1] 0

The primary key for nasaweather::atmos is lat, long, year, and month.

#fueleconomy::vehicles %>% head()
fueleconomy::vehicles %>%
  group_by(id) %>%
  mutate(n = n()) %>%
  filter(n > 1) %>%
  nrow()
## [1] 0

The primary key for fueleconomy::vehicles is id.

#ggplot2::diamonds %>% head()
ggplot2::diamonds %>%
  distinct() %>%
  nrow()
## [1] 53794
nrow(ggplot2::diamonds)
## [1] 53940

There is no primary key for ggplot2::diamonds. The number of distinct rows in the dataset is less than the total number of rows, suggesting that no combination of variables uniquely identifies the observations.

3. Draw a diagram illustrating the connections between the Batting, Master, and Salaries tables in the Lahman package. Draw another diagram that shows the relationship between Master, Managers, AwardsManagers. How would you characterise the relationship between the Batting, Pitching, and Fielding tables?

library(datamodelr)
dm1 <- dm_from_data_frames(list(Batting = Lahman::Batting,
                                Master = Lahman::Master,
                                Salaries = Lahman::Salaries)) %>%
  dm_set_key("Batting", c("playerID", "yearID", "stint")) %>%
  dm_set_key("Master", "playerID") %>%
  dm_set_key("Salaries", c("yearID", "teamID", "playerID")) %>%
  dm_add_references(
    Batting$playerID == Master$playerID,
    Salaries$playerID == Master$playerID
  )

graph1 <- dm_create_graph(dm1, rankdir = "LR", columnArrows = TRUE)
dm_render_graph(graph1)
library(datamodelr)
dm2 <- dm_from_data_frames(list(Master = Lahman::Master,
                                Managers = Lahman::Managers,
                                AwardsManagers = Lahman::AwardsManagers)) %>%
  dm_set_key("Master", "playerID") %>%
  dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
  dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
  dm_add_references(
    Managers$playerID == Master$playerID,
    AwardsManagers$playerID == Master$playerID
  )

graph2 <- dm_create_graph(dm2, rankdir = "LR", columnArrows = TRUE)
dm_render_graph(graph2)

Batting and Master can be matched by playerID. Salary can then be matched with ’Battingby playerID and yearID.MasterandManagerscan be matched by playerID.ManagersandAwardsManagers` can be matched by playerID and yearID.

Batting, Pitching, and Fielding all have a primary key of playerID, yearID, and stint, and have a one-to-one relationship to each other.

13.4.6 Exercises

1. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

(Don’t worry if you don’t understand what semi_join() does — you’ll learn about it next.) You might want to use the size or colour of the points to display the average delay for each airport.

flights %>% 
  group_by(dest) %>%
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  left_join(airports, by = c('dest' = 'faa')) %>%
  ggplot(aes(x = lon, y = lat, size = avg_arr_delay, color = avg_arr_delay)) +
  borders('state') +
  geom_point() +
  coord_quickmap()
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
## Warning: Removed 5 rows containing missing values (geom_point).

2. Add the location of the origin and destination (i.e. the lat and lon) to flights.

flights %>% 
  left_join(airports, by = c('dest' = 'faa')) %>%
  left_join(airports, by = c('origin' = 'faa'), suffix = c('.dest', '.origin')) %>%
  select(dest, origin, lat.dest, lon.dest, lat.origin, lon.origin)
## # A tibble: 336,776 x 6
##    dest  origin lat.dest lon.dest lat.origin lon.origin
##    <chr> <chr>     <dbl>    <dbl>      <dbl>      <dbl>
##  1 IAH   EWR        30.0    -95.3       40.7      -74.2
##  2 IAH   LGA        30.0    -95.3       40.8      -73.9
##  3 MIA   JFK        25.8    -80.3       40.6      -73.8
##  4 BQN   JFK        NA       NA         40.6      -73.8
##  5 ATL   LGA        33.6    -84.4       40.8      -73.9
##  6 ORD   EWR        42.0    -87.9       40.7      -74.2
##  7 FLL   EWR        26.1    -80.2       40.7      -74.2
##  8 IAD   LGA        38.9    -77.5       40.8      -73.9
##  9 MCO   JFK        28.4    -81.3       40.6      -73.8
## 10 ORD   LGA        42.0    -87.9       40.8      -73.9
## # ... with 336,766 more rows

3. Is there a relationship between the age of a plane and its delays?

plane_age <- planes %>%
  mutate(age = 2013 - year) %>%
  select(tailnum, age) %>%
  filter(!is.na(age))

flights %>%
  inner_join(plane_age, by = "tailnum") %>%
  group_by(age) %>%
  filter(!is.na(dep_delay)) %>%
  summarise(delay = mean(dep_delay)) %>%
  ggplot(aes(x = age, y = delay)) +
  geom_point() +
  geom_line()

No clear relationship between the age of a plane and its delays - from the plot, delay seems to decrease as the age of the plane increases, but this could be due to other factors as well.

4. What weather conditions make it more likely to see a delay?

flight_weather <- flights %>%
  inner_join(weather, by = c("origin" = "origin",
                             "year" = "year",
                             "month" = "month",
                             "day" = "day",
                             "hour" = "hour"))
flight_weather %>%
  group_by(precip) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = precip, y = delay)) +
  geom_line() + 
  geom_point()

flight_weather %>%
  group_by(wind_speed) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = wind_speed, y = delay)) +
  geom_line() + 
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

flight_weather %>%
  group_by(humid) %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = humid, y = delay)) +
  geom_line() + 
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_point).

Increased windspeed and increased precipitation appear to make it more likely to see a delay.

5. What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.

flights %>% 
  filter(year == 2013, month == 6, day == 13) %>%
  group_by(dest) %>%
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  left_join(airports, by = c('dest' = 'faa')) %>%
  ggplot(aes(x = lon, y = lat, size = avg_arr_delay, color = avg_arr_delay)) +
  borders('state') +
  geom_point(alpha = .5) +
  scale_color_continuous(low = 'yellow', high = 'red') + 
  coord_quickmap()
## Warning: Removed 7 rows containing missing values (geom_point).

There was a severe storm along the East Coast, resulting in increased delays in those areas.

13.5.1 Exercises

1. What does it mean for a flight to have a missing tailnum? What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(carrier, sort = TRUE)
## # A tibble: 10 x 2
##    carrier     n
##    <chr>   <int>
##  1 MQ      25397
##  2 AA      22558
##  3 UA       1693
##  4 9E       1044
##  5 B6        830
##  6 US        699
##  7 FL        187
##  8 DL        110
##  9 F9         50
## 10 WN         38

Many of the flights with tail numbers without a matching record in planes have the carrier “MQ” or “AA” - maybe these carriers just don’t report tail numbers.

2. Filter flights to only show flights with planes that have flown at least 100 flights.

flights_100 <- flights %>%
  filter(!is.na(dep_delay)) %>%
  group_by(tailnum) %>%
  summarize(n = n()) %>%
  filter(n > 100)

flights %>%
  semi_join(flights_100, by = 'tailnum')
## # A tibble: 225,766 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      544            545       -1.     1004
##  4  2013     1     1      554            558       -4.      740
##  5  2013     1     1      555            600       -5.      913
##  6  2013     1     1      557            600       -3.      709
##  7  2013     1     1      557            600       -3.      838
##  8  2013     1     1      558            600       -2.      849
##  9  2013     1     1      558            600       -2.      853
## 10  2013     1     1      558            600       -2.      923
## # ... with 225,756 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

3. Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.

fueleconomy::vehicles %>%
  semi_join(fueleconomy::common, by = c('make', 'model'))
## # A tibble: 14,531 x 12
##       id make  model  year class trans drive   cyl displ fuel    hwy   cty
##    <int> <chr> <chr> <int> <chr> <chr> <chr> <int> <dbl> <chr> <int> <int>
##  1  1833 Acura Inte…  1986 Subc… Auto… Fron…     4  1.60 Regu…    28    22
##  2  1834 Acura Inte…  1986 Subc… Manu… Fron…     4  1.60 Regu…    28    23
##  3  3037 Acura Inte…  1987 Subc… Auto… Fron…     4  1.60 Regu…    28    22
##  4  3038 Acura Inte…  1987 Subc… Manu… Fron…     4  1.60 Regu…    28    23
##  5  4183 Acura Inte…  1988 Subc… Auto… Fron…     4  1.60 Regu…    27    22
##  6  4184 Acura Inte…  1988 Subc… Manu… Fron…     4  1.60 Regu…    28    23
##  7  5303 Acura Inte…  1989 Subc… Auto… Fron…     4  1.60 Regu…    27    22
##  8  5304 Acura Inte…  1989 Subc… Manu… Fron…     4  1.60 Regu…    28    23
##  9  6442 Acura Inte…  1990 Subc… Auto… Fron…     4  1.80 Regu…    24    20
## 10  6443 Acura Inte…  1990 Subc… Manu… Fron…     4  1.80 Regu…    26    21
## # ... with 14,521 more rows

4. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

flights_48 <- flights %>%
  group_by(year, month, day) %>%
  summarise(delay_24h = sum(dep_delay, na.rm = TRUE) + sum(arr_delay, na.rm = TRUE)) %>%
  mutate(delay_48h = delay_24h + lag(delay_24h)) %>%
  filter(!(is.na(delay_48h))) %>%
  arrange(desc(delay_48h))

weather_48 <- weather %>%
  group_by(year, month, day) %>%
  summarize_at(vars(humid, precip, temp, visib), mean, na.rm = TRUE)

flights_48 %>%
  left_join(weather_48) %>% 
  head(10) 
## Joining, by = c("year", "month", "day")
## # A tibble: 10 x 9
## # Groups:   year, month [6]
##     year month   day delay_24h delay_48h humid   precip  temp visib
##    <dbl> <dbl> <int>     <dbl>     <dbl> <dbl>    <dbl> <dbl> <dbl>
##  1 2013.    7.    23    80641.   175419.  74.6 0.0154    80.6  9.33
##  2 2013.    3.     8   135264.   167530.  80.4 0.0192    35.5  4.91
##  3 2013.    6.    25    80434.   166649.  60.3 0.00153   81.9  9.97
##  4 2013.    8.     9    72866.   165287.  82.8 0.00153   78.5  9.08
##  5 2013.    6.    28    81389.   157910.  73.1 0.        78.0  9.82
##  6 2013.    7.    10    97120.   157396.  73.2 0.00111   80.6  9.24
##  7 2013.    4.    19    82556.   150125.  86.1 0.000833  61.3  5.25
##  8 2013.    3.     9     3243.   138507.  46.4 0.        43.5 10.0 
##  9 2013.    5.    24    51662.   137740.  84.1 0.0106    57.3  8.84
## 10 2013.    6.    14    39422.   136526.  70.6 0.0100    62.5  9.44
flights_48 %>%
  left_join(weather_48) %>% 
  tail(10) 
## Joining, by = c("year", "month", "day")
## # A tibble: 10 x 9
## # Groups:   year, month [3]
##     year month   day delay_24h delay_48h humid   precip  temp visib
##    <dbl> <dbl> <int>     <dbl>     <dbl> <dbl>    <dbl> <dbl> <dbl>
##  1 2013.    9.    10    -5264.   -16039.  73.0 0.        76.3  9.79
##  2 2013.    8.    27    -8109.   -16654.  66.3 0.000833  78.9  9.99
##  3 2013.    8.    26    -8545.   -16884.  59.1 0.000417  75.5 10.0 
##  4 2013.    9.     9   -10775.   -19308.  55.0 0.        65.2 10.0 
##  5 2013.    9.    18    -9508.   -19428.  55.4 0.        61.2 10.0 
##  6 2013.    9.     8    -8533.   -22665.  51.2 0.        73.3 10.0 
##  7 2013.    9.     5   -15387.   -28961.  48.8 0.        71.9 10.0 
##  8 2013.    9.     7   -14132.   -31731.  50.5 0.        68.7 10.0 
##  9 2013.   10.     2   -13623.   -31919.  58.9 0.        73.3  9.97
## 10 2013.    9.     6   -17599.   -32986.  45.5 0.        64.1 10.0

Precipitation and temperature appear to be slightly higher and visibility slightly lower for the worst delays compared to the smallest delays.

5. What does anti_join(flights, airports, by = c("dest" = "faa")) tell you? What does anti_join(airports, flights, by = c("faa" = "dest")) tell you?

anti_join(flights, airports, by = c("dest" = "faa"))
## # A tibble: 7,602 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      544            545       -1.     1004
##  2  2013     1     1      615            615        0.     1039
##  3  2013     1     1      628            630       -2.     1137
##  4  2013     1     1      701            700        1.     1123
##  5  2013     1     1      711            715       -4.     1151
##  6  2013     1     1      820            820        0.     1254
##  7  2013     1     1      820            820        0.     1249
##  8  2013     1     1      840            845       -5.     1311
##  9  2013     1     1      909            810       59.     1331
## 10  2013     1     1      913            918       -5.     1346
## # ... with 7,592 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Gives flights whose destinations are not present in airports dataset.

anti_join(airports, flights, by = c("faa" = "dest"))
## # A tibble: 1,357 x 8
##    faa   name                   lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport     41.1  -80.6  1044   -5. A     America/New_…
##  2 06A   Moton Field Municip…  32.5  -85.7   264   -6. A     America/Chic…
##  3 06C   Schaumburg Regional   42.0  -88.1   801   -6. A     America/Chic…
##  4 06N   Randall Airport       41.4  -74.4   523   -5. A     America/New_…
##  5 09J   Jekyll Island Airpo…  31.1  -81.4    11   -5. A     America/New_…
##  6 0A9   Elizabethton Munici…  36.4  -82.2  1593   -5. A     America/New_…
##  7 0G6   Williams County Air…  41.5  -84.5   730   -5. A     America/New_…
##  8 0G7   Finger Lakes Region…  42.9  -76.8   492   -5. A     America/New_…
##  9 0P2   Shoestring Aviation…  39.8  -76.6  1000   -5. U     America/New_…
## 10 0S9   Jefferson County In…  48.1 -123.    108   -8. A     America/Los_…
## # ... with 1,347 more rows

Gives airports that no flights in flights dataset are flying to (airports that are not destinations of flights in flights dataset).

6. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.

flights %>%
  select(carrier, tailnum) %>%
  group_by(tailnum) %>%
  summarize(n = length(unique(carrier))) %>%
  filter(n > 1)
## # A tibble: 18 x 2
##    tailnum     n
##    <chr>   <int>
##  1 N146PQ      2
##  2 N153PQ      2
##  3 N176PQ      2
##  4 N181PQ      2
##  5 N197PQ      2
##  6 N200PQ      2
##  7 N228PQ      2
##  8 N232PQ      2
##  9 N933AT      2
## 10 N935AT      2
## 11 N977AT      2
## 12 N978AT      2
## 13 N979AT      2
## 14 N981AT      2
## 15 N989AT      2
## 16 N990AT      2
## 17 N994AT      2
## 18 <NA>        7

Hypothesis rejected - some planes are flown by more than 1 airline.

15.3.1 Exercises

1. Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

ggplot(gss_cat, aes(rincome)) +
  geom_bar()

The default bar chart is hard to understand since the labels on the x axis are too crowded. This can be improved by flipping the x and y axes using coord_flip().

ggplot(gss_cat, aes(rincome)) +
  geom_bar() + 
  coord_flip()

2. What is the most common relig in this survey? What’s the most common partyid?

gss_cat %>%
  count(relig) %>%
  arrange(desc(n)) %>%
  head(1)
## # A tibble: 1 x 2
##   relig          n
##   <fct>      <int>
## 1 Protestant 10846

The most common relig in the survey is Protestant.

gss_cat %>%
  count(partyid) %>%
  arrange(desc(n)) %>%
  head(1)
## # A tibble: 1 x 2
##   partyid         n
##   <fct>       <int>
## 1 Independent  4119

The most common partyid in the survey is Independent.

3. Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?

denom applies to relig == Protestant.

With a table:

gss_cat %>%
  count(relig, denom) %>%
   filter(!denom %in% c("No answer", "Other", "Don't know", "Not applicable", "No denomination")) 
## # A tibble: 25 x 3
##    relig      denom                    n
##    <fct>      <fct>                <int>
##  1 Protestant Episcopal              397
##  2 Protestant Presbyterian-dk wh     244
##  3 Protestant Presbyterian, merged    67
##  4 Protestant Other presbyterian      47
##  5 Protestant United pres ch in us   110
##  6 Protestant Presbyterian c in us   104
##  7 Protestant Lutheran-dk which      267
##  8 Protestant Evangelical luth       122
##  9 Protestant Other lutheran          30
## 10 Protestant Wi evan luth synod      71
## # ... with 15 more rows

With a visualisation:

gss_cat %>%
  count(relig, denom) %>%
  ggplot(aes(relig, denom)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90))

15.4.1 Exercises

1. There are some suspiciously high numbers in tvhours. Is the mean a good summary?

summary(gss_cat[["tvhours"]])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   2.000   2.981   4.000  24.000   10146

No, the median would likely provide a better summary.

2. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

fct_gss <- gss_cat[sapply(gss_cat, is.factor)]
lapply(fct_gss, levels)
## $marital
## [1] "No answer"     "Never married" "Separated"     "Divorced"     
## [5] "Widowed"       "Married"      
## 
## $race
## [1] "Other"          "Black"          "White"          "Not applicable"
## 
## $rincome
##  [1] "No answer"      "Don't know"     "Refused"        "$25000 or more"
##  [5] "$20000 - 24999" "$15000 - 19999" "$10000 - 14999" "$8000 to 9999" 
##  [9] "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
## [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"
## 
## $partyid
##  [1] "No answer"          "Don't know"         "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"   
## 
## $relig
##  [1] "No answer"               "Don't know"             
##  [3] "Inter-nondenominational" "Native american"        
##  [5] "Christian"               "Orthodox-christian"     
##  [7] "Moslem/islam"            "Other eastern"          
##  [9] "Hinduism"                "Buddhism"               
## [11] "Other"                   "None"                   
## [13] "Jewish"                  "Catholic"               
## [15] "Protestant"              "Not applicable"         
## 
## $denom
##  [1] "No answer"            "Don't know"           "No denomination"     
##  [4] "Other"                "Episcopal"            "Presbyterian-dk wh"  
##  [7] "Presbyterian, merged" "Other presbyterian"   "United pres ch in us"
## [10] "Presbyterian c in us" "Lutheran-dk which"    "Evangelical luth"    
## [13] "Other lutheran"       "Wi evan luth synod"   "Lutheran-mo synod"   
## [16] "Luth ch in america"   "Am lutheran"          "Methodist-dk which"  
## [19] "Other methodist"      "United methodist"     "Afr meth ep zion"    
## [22] "Afr meth episcopal"   "Baptist-dk which"     "Other baptists"      
## [25] "Southern baptist"     "Nat bapt conv usa"    "Nat bapt conv of am" 
## [28] "Am bapt ch in usa"    "Am baptist asso"      "Not applicable"

For marital, race, partyid, relig, and denom, the order of the levels is arbitrary. For rincome, the order lof levels is principled.

3. Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?

Moving “Not applicable” to the front of the levels gives it an integer value of 1.

15.5.1 Exercises

1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

levels(gss_cat$partyid)
##  [1] "No answer"          "Don't know"         "Other party"       
##  [4] "Strong republican"  "Not str republican" "Ind,near rep"      
##  [7] "Independent"        "Ind,near dem"       "Not str democrat"  
## [10] "Strong democrat"
gss_cat %>%
  mutate(partyid = fct_collapse(partyid, 
  Democrat = c("Not str democrat", "Strong democrat"),
  Republican = c("Strong republican", "Not str republican"),
  Independent = c("Ind,near rep", "Independent", "Ind,near dem"),
  Other = c("No answer", "Don't know", "Other party"))) %>%
  count(year, partyid)  %>%
  group_by(year) %>%
  mutate(p = n / sum(n)) %>%
  ggplot(aes(year, p, group = partyid, color = partyid)) +
    geom_point() +
    geom_line()

The number of Independents is increasing, the number of Democrats has fluctuated but overall remains at a steady level, and the number of Republicans is decreasing.

2. How could you collapse rincome into a small set of categories?

gss_cat %>%
  mutate(rincome = fct_collapse(rincome, `Unknown` = c("No answer", "Don't know", "Refused", "Not applicable"),
                                `Lt $5000` = c("Lt $1000", str_c("$", c("1000", "3000", "4000"), " to ", c("2999", "3999", "4999"))),
                                `$5000 to 10000` = str_c("$", c("5000", "6000", "7000", "8000"), " to ", c("5999", "6999", "7999", "9999")))) %>%
  ggplot(aes(rincome)) +
  geom_bar() +
  coord_flip()